Probability & Statistics
Q. Why Probability and Statistics are important?
Ans :
Probability(means Chance) : It is a number that indicates how likely the event is to occur. It is expressed as a number in the range from 0 and 1, or, using percentage notation, in the range from 0% to 100%.
Statistics : Statistics means studying, collecting, analyzing, interpreting, and organizing data.
Q. What is Random Variable ?
Ans : A random variable is a variable whose value is a numerical outcome of a random phenomenon. There are two kinds of random variable:-
1. Discrete Random Variable : takes one of the value of discrete set or finite set.
A discrete random variable X has a countable number of possible values.
Example: Let X represent the sum of two dice.
To graph the probability distribution of a discrete random variable, we construct a probability histogram.
2. Continuous Random Variable : takes all values in a given interval of numbers.
- The probability distribution of a continuous random variable is shown by a density curve.
- The probability that a continuous random variable X is exactly equal to a number is zero.
Q. What is Outlier?
Ans : An outlier is a single data point that goes far outside the average value of a group of statistics.
Note : Mean & Variance can be corrupted by a single outlier. So we use Median to solve this issue.
Q. What is difference between Population & Sample ?
Ans :
Population : Whole data
Sample : Small data drawn from Population. A subset of Population.
population mean is denoted by mue.
Sample mean is denoted by x bar.
Note : As the sample size increases, sample means reaches to population mean
Gaussian Normal Distribution and its PDF(Probability Density Function) :
Gaussian distribution is a type of continuous probability distribution for a real-valued random variable.
The mean of the distribution determines the location of the center of the graph, the standard deviation determines the height and width of the graph and the total area under the normal curve is equal to 1.
Q. Why Normal distribution is so important ?
Ans : because Many naturally-occurring phenomena tend to approximate the normal distribution.
and Distribution are simple models for natural behaviours. Using properties of a distribution , we are conclude many info. about data.
- Important Points :
Normal distributions are symmetrical, but not all symmetrical distributions are normal.
All normal distributions can be described by just two parameters: the mean and the standard deviation.
mean= median =mode
varinace is the measure of spread .For Standard Normal Distribution, Variance =1 and Mean = 0
Kurtosis measures the thickness of the tail ends of a distribution in relation to the tails of a distribution. The normal distribution has a kurtosis equal to 3 .
Skewness measures the degree of symmetry of a distribution. The normal distribution is symmetric and has a skewness of zero
- leptokurtic = low kurtosis less than 3.0 (fat tails)
- platykurtic = low kurtosis less than 3.0 (skinnier)
- Normal distribution follows the Central Limit Theorem .
Q. What is Central Limit Theorem ?
Ans : The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement text annotation indicator, then the distribution of the sample means will be approximately normally distributed.
This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30).
CDF(Cumulative Distribution function) of GaussianNormal distribution :¶
CDF always lies between 0 to 1.
CDF of Normal Distribution is 'S' shaped curve
Normal Distribution follow 68–95–99.7 rule.
Symmetric distribution, Skewness and Kurtosis :¶
Q. What is Skewness ?
Ans : Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
The skewness value can be positive, zero, negative, or undefined.There are type type of Skewness :-
Negative Skewness : The left tail is longer .
A left-skewed distribution usually appears as a right-leaning curve. The mass of the distribution is concentrated on the right of the figure.Positive Skewness : The Right tail is longer .
A right-skewed distribution usually appears as a left-leaning curve. The mass of the distribution is concentrated on the left of the figure.
Q. What is relation of Mean, Median & Mode with Skewness ?
Ans : The skewness is not directly related to the relationship between the mean and median.
- (Mode-Mean) = 3*(Median - Mean)
Q. What is Kurtosis?
Ans : Kurtosis and Excess-Kurtosis are two different terms and create confusion.
Excess_Kurtosis = Kurtosis - 3
Kurtosis of Normal distribution = 3
Excess-Kurtosis of Normal distribtuion = 0
- Platykurtic : excess-kurtosis < 0
- Mesokurtic : excess-kurtosis = 0
- Leptokurtic : excess-kurtosis > 0
Summary :
Mean : tells about the location of distribution
Variance : tells about the spread of distribution
Skewness : tells about how dissimilar from symmetric disribution
Kurtosis : tells about the peak of distribution
- Standard normal variate (Z) and standardization :-
Q. What is Standard normal variate ?
Ans : A standard normal variate is a normal variate with mean µ=0 and standard deviation σ =1 with a probability density function f(z).
It is denoted by 'Z' .
- Standard normal variables play a major role in regression analysis, the analysis of variance and time series analysis.
Q. What is standardization ?
Ans : It is a tranformation method to change a normal distribution to Standard Normal Distribution .
Not only Normal Random Varible , Given any Random variable with Mean & Standard deviation , We can convert into Standard Normal Variate using below formula
Z = (X-µ)/σ
where X = arandom variable which follow Normal Distribution
Z = Standard Normal Variate
µ = Mean
σ = Standard Deviation
Q. Why We need of Standarization ?
Ans : After Standarization We can draw useful insights about the data.
Kernel density estimation using Gaussian Kernals:-
Q. What is KDE ?
Ans : Kernal Density Estimation is an unsupervised learning technique that helps to estimate the PDF of a random variable in a non-parametric way.
- It is related to a histogram but with a data smoothing technique.
- KDE is convert Histogram into PDF (probability density function)
- This technique allow us to create a smooth curve given a set of random data.
- It can also be used to generate points that look like they came from a certain dataset - this behavior can power simple simulations, where simulated objects are modeled off of real data.
Q. Why KDE is required ?
Ans : Histogram are not smooth, they depend on the width of bins and the endpoints of the bins, KDE reduce the problem by
providing smoother curves.
in this picture, red line curves are gaussian kernal (Mean for this gausaian kernal are the each data points and varinace also called as bandwidth)
Q. What should be the varinace for gaussian kernals(bandwidth) for gaussina kernals?
Ans :
if bandwodth are small : kernals will be so jag
if bandwodth are large : kernals will be flat.
In seaborn, there are some nice heuristics to find right bandwidth
Sampling Distribution & Central Limit Theorem :
Q. What is Sampling Distribution ?
Ans :
Distribution of Sample Means are called as Sampling Distribution.
Q. Is infinite Mean and Infinte Variable possible ? If Yes, then How can a distribution have infinite mean and variance?
Ans : best example is Pareto Distribution
Q. Can You say Central Limit Theorem in one word ?
Ans : Yes;
CLT States that :-
"Sampling Distribution of Sample Mean follows Normal Distribution".
Q. What is this term "Sampling Distribution of Sample Mean" ? What is its meaning ?
Ans : "Sampling Distribution of Sample Mean" = Sampling + Distribution of Sample Mean.
Sampling : It is the process by using we have taken sample from population
Distribution of Sample Mean : Plotting of mean for each sample
Q. Why Central Limit Theorem is beautiful ?
Ans : because by looking out the only samples of data points, we can conclude about the whole population.
The field of statistics is based upon the fact that it is rarely feasible or practical to collect all of the data from an entire population. Instead, we can gather a subset of data from a population, and then use statistics for that sample to draw conclusions about the population.
to understand population mean and population varinace of any distribtion, We just need to know that they are finite & well-defined.
We can estimate by doing simple sampling exercise and computing the "Sampling Distribution of Sample Means which will follow Normal Distribution".
So We will know, population Mean by computing Mean of Distribution of Sample Means population Mean &
population Varinace by computing Varinace of Distribution of Sample Means.
- by observation : when Sample Size (n) >30 , CLT applicable for any distribution
- ref : https://www.minitab.com/content/dam/www/en/uploadedfiles/content/academic/CentralLimitTheorem.pdf
- ref : https://stats.stackexchange.com/questions/541379/testing-the-central-limit-theorem-with-the-shapiro-wilk-test-on-dice-rolling-sim
Q. Is Central Limit Theorem applicable for every distribution ?
Ans : No; "the Central Limit Theorem doesn't work with every distribution. This is due to one sneaky fact — sample means are clustered around the mean of the underlying distribution if it exists. But how can a distribution have no mean? Well, one common distribution that has no finite mean in special cases is the Pareto distribution.
- Note : CLT is valid only for those distribution which have Finite Mean & Finite Variance
- Research Area : Apply CLT in the distribution which have infinte Mean & variance
Q. What is Central Limit Theorem ? Define Formally.
Ans :
The formal definition of central limit theorem states that:
" For a population with mean (µ) and standard deviation (σ), if we take sufficiently large random samples from the population with replacement, then the distribution of the sample means (also known as sampling distribution of means) will approximate to the normal distribution.
This will hold true regardless of the distribution of the source population whether it is normal, skewed, uniform or completely random, provided the sample size (n) is sufficiently large (typically n > 30). When the population is normally distributed, the theorem holds true even for smaller sample size i.e. n < 30. "
- Properties of CLT :
μx̅ ~ μ , Mean of sample means(μx̅) will be approximately equal to the population mean (μ) .
σx̅ = σ/√n , As we increase the sample size (n), the standard deviation of the sampling distribution of means (σx̅ = σ/√n) will become smaller.
where,
population mean : μ
population standard deviation : σ
sample mean : x̅i
mean of all the sample means (x̅1, x̅2, x̅3,…, and x̅m) : μx̅
standard deviation of all 'm' the sample means (x̅1, x̅2, x̅3,…, and x̅m) : σx̅
size of each sample : n
Q.How to test if a random variable is normally distributed or not ?¶
Ans : there are many other technique(statistical testing like KS test , AD test(more powerful)) to know a Randome varibale is Normally distributed or not? , But Quantile-Quantile plot (Q-Q plot) is the simplest graphical method to know.
Q. How to plot Q-Q plot ? How to read Q-Q plot ?
Ans :
Random Varibale X = x1, x2, x3 , ... , x500
Step1 : Sort the data points and compute their percentile
Step2 : Create a Standard Normal Random varibale by using percentile of the it's sorted data points. {called as theoratical Qunantiles}
Step3 : Plot theoratical Qunantiles on X-axis Vs Random Varibale X on y-axis
- Note : if all the points lies on the staright line then , Random Variable X and Y have similar Distribution, So we can say that if Y is Normally Distributed then X will also have a Normal Distribution.
import numpy as np
import pylab
import scipy.stats as stats
# N(0,1), Generate size = 1000 Random observation from Normal Distribtuion, Here loc=0 means mean = 0 , scale=1 means Std. Deviation =1
std_normal = np.random.normal(loc=0, scale= 1, size=1000)
#0 to 100th percentiles of std-normal
for i in range(0,101):
print(i,"th percentile = ",np.percentile(std_normal,i))
0 th percentile = -3.499107240994038 1 th percentile = -2.427182009233492 2 th percentile = -2.070223435950522 3 th percentile = -1.9625915961504172 4 th percentile = -1.8209462403785646 5 th percentile = -1.743883213710714 6 th percentile = -1.6149053963860063 7 th percentile = -1.5463815055112005 8 th percentile = -1.481084317051865 9 th percentile = -1.4069825089944539 10 th percentile = -1.3322396618204373 11 th percentile = -1.272714229737105 12 th percentile = -1.1860768190894677 13 th percentile = -1.1477211277530386 14 th percentile = -1.1110624123209727 15 th percentile = -1.0798864897965272 16 th percentile = -1.0219101521089489 17 th percentile = -0.9709299558307669 18 th percentile = -0.9303770995467152 19 th percentile = -0.9045841900330435 20 th percentile = -0.8562534443322407 21 th percentile = -0.8261124535907577 22 th percentile = -0.7654617694760109 23 th percentile = -0.7376966275274653 24 th percentile = -0.7052229899922399 25 th percentile = -0.676003680087186 26 th percentile = -0.6372133640732198 27 th percentile = -0.5881785838927646 28 th percentile = -0.5543907999277767 29 th percentile = -0.5188507209402796 30 th percentile = -0.4857789676176328 31 th percentile = -0.44409578639599284 32 th percentile = -0.4205507764020476 33 th percentile = -0.3947487620394244 34 th percentile = -0.3674692731151177 35 th percentile = -0.34559662140631875 36 th percentile = -0.32787030042527915 37 th percentile = -0.3002173974576282 38 th percentile = -0.2832327122015649 39 th percentile = -0.25690842460934715 40 th percentile = -0.22226193981019468 41 th percentile = -0.1898449388933453 42 th percentile = -0.16322030105929394 43 th percentile = -0.12634520458955492 44 th percentile = -0.08241532746494765 45 th percentile = -0.06198198123758298 46 th percentile = -0.033097400036885286 47 th percentile = -0.019720246741775926 48 th percentile = -0.001753052277221307 49 th percentile = 0.026216469747475723 50 th percentile = 0.04494442974715856 51 th percentile = 0.06016656165354633 52 th percentile = 0.08854577182103474 53 th percentile = 0.11863109361413131 54 th percentile = 0.13893193534723275 55 th percentile = 0.1676792880252538 56 th percentile = 0.19501902424641532 57 th percentile = 0.21008010413821349 58 th percentile = 0.22920822468730945 59 th percentile = 0.25296205335820665 60 th percentile = 0.27352446745698034 61 th percentile = 0.3120924784175149 62 th percentile = 0.3236996010028199 63 th percentile = 0.33746884025595736 64 th percentile = 0.3675352233038593 65 th percentile = 0.380599262095988 66 th percentile = 0.4179182525790814 67 th percentile = 0.45735012309443296 68 th percentile = 0.48354560673864694 69 th percentile = 0.5334410575367026 70 th percentile = 0.5768445460856778 71 th percentile = 0.6037820090453173 72 th percentile = 0.6271377017496592 73 th percentile = 0.6497411037527893 74 th percentile = 0.6724994332800354 75 th percentile = 0.7175587082151655 76 th percentile = 0.7500539789886305 77 th percentile = 0.7769110464192663 78 th percentile = 0.8144615373916797 79 th percentile = 0.8563242439676142 80 th percentile = 0.8879321065945307 81 th percentile = 0.9185637239401587 82 th percentile = 0.9544595969665366 83 th percentile = 0.9802688928648484 84 th percentile = 1.0142445759676963 85 th percentile = 1.0524550503578058 86 th percentile = 1.1061472462394062 87 th percentile = 1.1559903749964504 88 th percentile = 1.2263073763598396 89 th percentile = 1.3171644194790868 90 th percentile = 1.3511737703263567 91 th percentile = 1.397394030176791 92 th percentile = 1.4755717075799832 93 th percentile = 1.554195186098769 94 th percentile = 1.5958476925024399 95 th percentile = 1.68312169911816 96 th percentile = 1.820991217729968 97 th percentile = 1.9557747351016819 98 th percentile = 2.232644951437211 99 th percentile = 2.439658572200899 100 th percentile = 3.403573917938617
# generate 100 sample from N(20,5)
measurement = np.random.normal(loc=5,scale=20,size=100)
#try size = 50, 1000 , As sample size increases , more & more points will be on staright line
# Limittaion of Q-Q plot : hard to interpert any conclusion when Sample Size is small
#Q-Q plot : compare with standard nomral variable
stats.probplot(measurement,dist="norm",plot=pylab)
pylab.show()
# Since they are co-linear , So Both have same type of distribution.
# Generate 100 sample from uniform distribution
measurement = np.random.uniform (low=-1,high=1, size =100)
# try size = 50 ,1000
stats.probplot(measurement, dist="norm",plot=pylab)
pylab.show()
# here we observer that , with less no of data points , It is difficult to interpret
# variable is normal or not ?
-Advantage of Q-Q Plot :
- To check A Random varible is Normallly Distributed or not ?
- Does random variables X, Y have same type of distribution?
Q. How/Where/When to use Distribtuions in Real World ?¶
Ans : All these Probability & Statistical Tools are used in Exploratory Data Analysis.
Data Analysis is nothing but Answering the questions about data.
All these distribution helps in Exploratory Data Analysis question about data.
Gaussian distribution is theoratical model of distributon of data that is observed in many Natural phenomena. We can use itto get insight easily.
Chebyshev’s inequality : (It is valid for any distribution)
if X is any random variable with finite mean and non-zero finite standard deviation, then
Note : In other words, Use below form of this formula
Discrete and Continuous Uniform distributions :
1. Discrete Unifrom Distribution :
If a random varibale is discrete and it follows Unifrom distribution , Then It is called as Discrete Random Distribution.
- PMF : Probability Mass Funtion is for Discrete random varibale
- In Unifrom Distribution all the events are equi-probable.
- It is symmetric distribution
- It is not skewed , so it's skewness is 0
- Ex : throwing a dice , tossing a coin
2. Continuous Unifrom Distribution :
If a random varibale is continuous and it follows Unifrom distribution , Then It is called as Continuous Random Distribution.
- PDF : Probability Density Funtion is for Continuous random varibale
#Example : random function generate random data points between (0,1) using Uniform Distribution
import random
print(random.random())
0.8029317742016568
Q. How to randomly sample data points?
Ans : Using Unifrom Distribution, We generates random numbers with all have equal probability
- Most Random Number generator follows Uniform Distribution
# load IRIS dataset with 150 points
from sklearn import datasets
iris = datasets.load_iris()
d = iris.data
d.shape
(150, 4)
# Sample 30 data points randomly from the 150 points dataset
n=150;
m=30;
p=m/n;
sample_data =[];
for i in range(0,n):
if random.random() <=p:
sample_data.append(d[i:1])
len(sample_data) # size of random sample is roughly 30, not exactly 30, Try out
29
Bernoulli Distribution :
- It is discrete
- It is used when we have only two outcomes. Ex:- tossing a coin
Bionomial Distribution :
- It is discrete
- Bernoulli random varible with n trails
Log-Normal Distribution :
A Random Varibale X is said to be Log-Normal Distribution if log(X) follow Normal Distribution.
Also Called as Galton distribution
Continuous Probability Distribution
Not Symmetric
Just like Normal Distribution , It also has two parameters : Mean & Varinace
Example of Log-Normal Distribution in Real Life :
The length of comments posted in Internet discussion forums follows a log-normal distribution. (Most of the commnets are small ,only some of the comments have large length )
Measures of size of living tissue (length, skin area, weight).
In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally. (The distribution of higher-income individuals follows a Pareto distribution).
In scientometrics, the number of citations to journal articles and patents follows a discrete log-normal distribution
- Fun Part 😀 : Love ❤ Relationships💑 follow Log-Normal Distribution.
Q. How to know a Random Varibale follow Log-Normal Distribution ?
Ans : take log of all data ponits and then plot Q-Q plot
Power Law :
- Occures lot in Nature
- When a distribution follow Power Law, then It is called as Pareto Distribution
Pareto distribution :
- Continuous distribution
Example of Pareto Distribution in Real Life :
Sizes of sand particles
The length distribution in jobs assigned to supercomputers (a few large ones, many small ones)
The sizes of human settlements (few cities, many hamlets/villages)
File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones)
Q. How to know Something is following Power Law or not ?
Ans : Instead of ploting X vs Y, plot log(X) Vs log(Y), if you get straight line ,then It will follow Power Law ,It works most of the time , not always
Q. How to know a random Variable follow Pareto Distribution ?
Ans : by using Q-Q plot [make one is pareto and other is your observation]
Box-Cox Transformation :
We have seen that , Given a random varibable which was log-normally distributed , We can easily converted it into Gaussian Distributed random varibale Y by taking natural log of all data points.
- In Machine Learning, Most of the time we try assume or to convert other random variable into Gaussian Random variable,Because We can derive many insights from it.
_____________________ See Part 2 _____________________________________________